microRNA expression (BCGSC RPKM)

The goal of this notebook is to introduce you to the microRNA expression BigQuery table.

This table contains all available TCGA Level-3 microRNA expression data produced by BCGSC's microRNA pipeline using the Illumina HiSeq platform, as of July 2016. The most recent archive (eg bcgsc.ca_THCA.IlluminaHiSeq_miRNASeq.Level_3.1.9.0) for each of the 32 tumor types was downloaded from the DCC, and data extracted from all files matching the pattern %.isoform.quantification.txt. The isoform-quantification values were then processed through a Perl script provided by BCGSC which produces normalized expression levels for mature microRNAs. Each of these mature microRNAs is identified by name (eg hsa-mir-21) and by MIMAT accession number (eg MIMAT0000076).

In order to work with BigQuery, you need to import the python bigquery module (gcp.bigquery) and you need to know the name(s) of the table(s) you are going to be working with:


In [5]:
import gcp.bigquery as bq
miRNA_BQtable = bq.Table('isb-cgc:tcga_201607_beta.miRNA_Expression')

From now on, we will refer to this table using this variable ($miRNA_BQtable), but we could just as well explicitly give the table name each time.

Let's start by taking a look at the table schema:


In [6]:
%bigquery schema --table $miRNA_BQtable


Out[6]:

Now let's count up the number of unique patients, samples and aliquots mentioned in this table. We will do this by defining a very simple parameterized query. (Note that when using a variable for the table name in the FROM clause, you should not also use the square brackets that you usually would if you were specifying the table name as a string.)


In [7]:
%%sql --module count_unique

DEFINE QUERY q1
SELECT COUNT (DISTINCT $f, 25000) AS n
FROM $t

In [8]:
fieldList = ['ParticipantBarcode', 'SampleBarcode', 'AliquotBarcode']
for aField in fieldList:
  field = miRNA_BQtable.schema[aField]
  rdf = bq.Query(count_unique.q1,t=miRNA_BQtable,f=field).results().to_dataframe()
  print " There are %6d unique values in the field %s. " % ( rdf.iloc[0]['n'], aField)


 There are  10245 unique values in the field ParticipantBarcode. 
 There are  11015 unique values in the field SampleBarcode. 
 There are  11077 unique values in the field AliquotBarcode. 
We can do the same thing to look at how many unique microRNAs exist in the table:

In [9]:
fieldList = ['mirna_id', 'mirna_accession']
for aField in fieldList:
  field = miRNA_BQtable.schema[aField]
  rdf = bq.Query(count_unique.q1,t=miRNA_BQtable,f=field).results().to_dataframe()
  print " There are %6d unique values in the field %s. " % ( rdf.iloc[0]['n'], aField)


 There are    965 unique values in the field mirna_id. 
 There are   1222 unique values in the field mirna_accession. 

These counts show that the mirna_id field is not a unique identifier and should be used in combination with the MIMAT accession number.

Another thing to note about this table is that these expression values are obtained from two different platforms -- approximately 15% of the data is from the Illumina GA platform, and 85% from the Illumina HiSeq:


In [10]:
%%sql

SELECT
  Platform,
  COUNT(*) AS n
FROM
  $miRNA_BQtable
GROUP BY
  Platform
ORDER BY
  n DESC


Out[10]:
Platformn
IlluminaHiSeq23226554
IlluminaGA3536468

(rows: 2, time: 1.7s, 390MB processed, job: job_mz2XhfqdyCl9bBTe_6-N-l-aD7s)